TripleTen - Sprint_06_Development_Tools_Project - Dan Lieberman¶

Contents ¶

  • Introduction

  • Stage 1. Data overview

    • Conclusions
  • Stage 2. Data preprocessing

    • 2.1 Missing values
    • 2.2 Removing records with NaN model year
    • 2.3 The cylinder field
    • 2.4 Duplicates
    • Conclusions
  • Stage 3. Exploratory Data Analysis - EDA

    • 3.1 Cylinders column
      • Conclusions
    • 3.2 Manufacturer of 4WD cars by Year - Histograms
      • Conclusions
    • 3.3 Price Vs Model Year - Scatter Plot
      • Conclusions
  • Stage 4. Develop a Web Application, and Deploy it to Render

  • Final Conclusion

Introduction ¶

This project is designed to give us more practice with common software engineering tasks, which will help us enhance and complement our data skills.

We were given the challenge of not only completing typical analysis tasks but also developing and deploying a web application to a cloud service for public access.

To support us in this endeavor, we were given a dataset containing car sales advertisements.

Goal:¶

The project is based on 4 main steps:

  1. Examine the methods for filling missing values in the 'cylinders' field.
  2. Create a couple of histograms and scatterplots
  3. Develop a web application dashboard.
  4. Deploy the application to Render.

Stages¶

Data on car sales advertisements is stored in the file vehicles_us.csv.

First, we'll create the following accounts for the project's Virtual Environment: GitHub, Render, VS Code, along side with the necessary packages installation.

Second, we'll Download the data file.

Third, we'll evaluate the quality of the data and see whether its issues are significant.

--

The project will consist of the following stages:

  1. Data overview.
  2. Data preprocessing.
  3. Exploratory Data Analysis - EDA.
  4. Develop a web application, and Deploy it to Render.

Back to Contents

Stage 1. Data overview ¶

In [1]:
import pandas as pd
df = pd.read_csv('../vehicles_us.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB
In [2]:
df.head(10)
Out[2]:
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0 automatic SUV NaN 1.0 2018-06-23 19
1 25500 NaN ford f-150 good 6.0 gas 88705.0 automatic pickup white 1.0 2018-10-19 50
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0 automatic sedan red NaN 2019-02-07 79
3 1500 2003.0 ford f-150 fair 8.0 gas NaN automatic pickup NaN NaN 2019-03-22 9
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0 automatic sedan black NaN 2019-04-02 28
5 14990 2014.0 chrysler 300 excellent 6.0 gas 57954.0 automatic sedan black 1.0 2018-06-20 15
6 12990 2015.0 toyota camry excellent 4.0 gas 79212.0 automatic sedan white NaN 2018-12-27 73
7 15990 2013.0 honda pilot excellent 6.0 gas 109473.0 automatic SUV black 1.0 2019-01-07 68
8 11500 2012.0 kia sorento excellent 4.0 gas 104174.0 automatic SUV NaN 1.0 2018-07-16 19
9 9200 2008.0 honda pilot excellent NaN gas 147191.0 automatic SUV blue 1.0 2019-02-15 17
In [3]:
df.tail(5)
Out[3]:
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
51520 9249 2013.0 nissan maxima like new 6.0 gas 88136.0 automatic sedan black NaN 2018-10-03 37
51521 2700 2002.0 honda civic salvage 4.0 gas 181500.0 automatic sedan white NaN 2018-11-14 22
51522 3950 2009.0 hyundai sonata excellent 4.0 gas 128000.0 automatic sedan blue NaN 2018-11-15 32
51523 7455 2013.0 toyota corolla good 4.0 gas 139573.0 automatic sedan black NaN 2018-07-02 71
51524 6300 2014.0 nissan altima good 4.0 gas NaN automatic sedan NaN NaN 2018-06-05 10
In [4]:
df.isnull().sum()
Out[4]:
price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

conclusion ¶

In this stage, we conducted an initial overview of the car sales advertisements dataset, which contains 51,525 entries and 13 columns. Key findings include:

Data Structure: The dataset includes attributes like price, model_year, model, condition, cylinders, fuel, odometer, transmission, type, paint_color, is_4wd, date_posted, and days_listed.

Missing Values: Several columns have missing values: model_year (3,619), cylinders (5,260), odometer (7,892), paint_color (9,267), and is_4wd (25,953).

Data Types: The data types are appropriate for each column

Back to Contents

Stage 2. Data preprocessing ¶

2.1 Missing values ¶

In [5]:
df['paint_color'] = df['paint_color'].fillna('unknown')
df['is_4wd'] = df['is_4wd'].fillna(0)
df['is_4wd'] = df['is_4wd'] == 1
df['odometer'] = df['odometer'].fillna(0)

2.2 Removing records with NaN model year ¶

Although the model year is a crucial data point in advertising, yet some records lack this information.

Instead of imputing missing model years, I chose to remove these records to maintain the integrity of my analysis.

However, further analysis and accuracy could benefit from additional data containing this information.

In [6]:
# Remove the records of the cars with model_year is Nan
df = df[df['model_year'].notna()]
# Changing the model_year type to int
df['model_year'] = df['model_year'].astype('int')
In [7]:
df.isnull().sum()
Out[7]:
price              0
model_year         0
model              0
condition          0
cylinders       4897
fuel               0
odometer           0
transmission       0
type               0
paint_color        0
is_4wd             0
date_posted        0
days_listed        0
dtype: int64

Back to Contents

2.3 The cylinder field ¶

As evident from the list of missing values, we still have gaps in the 'cylinders' field.

We plan to address this issue later, as it requires careful consideration and may impact the accuracy of our analysis.

2.4 Duplicates ¶

Obvious Duplicates¶

In [8]:
# counting clear duplicates
print('Number of overall duplicates:', df.duplicated().sum())
Number of overall duplicates: 0

Implicit Duplicates¶

In [9]:
def print_sorted_unique_values(df, columns):
    for column in columns:
        unique_values = df[column].astype(str).unique()
        unique_values.sort()
        print(f"Unique values for {column}: {unique_values}")

# columns to search in
columns_to_check = ['condition', 'fuel', 'transmission', 'type', 'paint_color']

# Call the function
print_sorted_unique_values(df, columns_to_check)
Unique values for condition: ['excellent' 'fair' 'good' 'like new' 'new' 'salvage']
Unique values for fuel: ['diesel' 'electric' 'gas' 'hybrid' 'other']
Unique values for transmission: ['automatic' 'manual' 'other']
Unique values for type: ['SUV' 'bus' 'convertible' 'coupe' 'hatchback' 'mini-van' 'offroad'
 'other' 'pickup' 'sedan' 'truck' 'van' 'wagon']
Unique values for paint_color: ['black' 'blue' 'brown' 'custom' 'green' 'grey' 'orange' 'purple' 'red'
 'silver' 'unknown' 'white' 'yellow']

conclusion ¶

In conclusion, the data preprocessing stage involved several key steps to prepare the dataset for analysis:

Handling Missing Values:¶

Missing values in the 'paint_color', 'is_4wd', and 'odometer' columns were filled or replaced appropriately. Records with missing model years were removed from the dataset to maintain data integrity.

Cylinder Field:¶

The 'cylinders' field still contains missing values, which will be addressed in subsequent stages due to its potential impact on analysis accuracy.

Duplicates:¶

Both obvious and implicit duplicates were checked and appropriately handled.

Further analysis and model development will benefit from these preprocessing steps, ensuring that the dataset is ready for detailed exploration and modeling.

Back to Contents

Stage 3. Exploratory Data Analysis - EDA ¶

3.1 Cylinders column ¶

In the following section, we will delve into the 'cylinders' column.

While this examination may not align with traditional exploratory data analysis (EDA) procedures, I have chosen to include it in this section due to its relevance in understanding the dataset.

This analysis has also led to valuable insights for further testing and analysis

In [10]:
df['cylinders'].unique()
Out[10]:
array([ 6.,  4.,  8., nan,  5., 10.,  3., 12.])
In [11]:
# Group the data by 'model' and 'model_year'
grouped = df.groupby(['model', 'model_year'])

# Calculate the median number of cylinders for each group
median_cylinders = grouped['cylinders'].median()
median_cylinders.unique()
Out[11]:
array([ 6. ,  8. ,  7. ,  nan,  4. ,  5. ,  4.5, 10. ,  9. ,  5.5])

Upon reviewing the results, it appears that using the median function may not be the most appropriate approach.

The issue arises when there is an even number of candidates, as the median function returns their mean value.

A potentially better alternative is to determine the most common value in the group, which aligns more closely with the intended purpose of the median function but with greater accuracy, given the size of the dataset.

Please note that I used 'groupby' and 'median' functions to calculate the median number of cylinders for each group.

In [12]:
grouped = df.groupby(['model', 'model_year'])
grouped['cylinders'].unique()
Out[12]:
model              model_year
acura tl           1999               [6.0, nan]
                   2001                    [6.0]
                   2002                    [6.0]
                   2003               [6.0, nan]
                   2004          [6.0, 4.0, nan]
                                      ...       
volkswagen passat  2014          [4.0, nan, 5.0]
                   2015          [4.0, nan, 5.0]
                   2016          [4.0, nan, 6.0]
                   2017               [4.0, nan]
                   2018                    [4.0]
Name: cylinders, Length: 2226, dtype: object

Back to Contents

Using of the 'Most Common' approach¶

In [13]:
# Group the data by 'model' and 'model_year'
grouped = df.groupby(['model', 'model_year'])

# Calculate the most common number of cylinders for each group
most_common_cylinders = grouped['cylinders'].apply(lambda x: x.mode()[0] if len(x.mode()) > 0 else None)
most_common_cylinders.unique()
Out[13]:
array([ 6.,  8., nan,  4.,  5., 10.])

It's evident that the numbers 3 and 12 are missing from the new list. This discrepancy may be attributed to errors in the records corresponding to these values, although it could also be related to sub-model issues.

Therefore, I opted to correct only the NaN values and not the records associated with 3 and 12.

Additionally, some records have missing values for the 'cylinders' field, indicated by 'None'.

In [14]:
# Fill NaN values in 'cylinders' column with the calculated must common values
common_cylinders_series = df.set_index(['model', 'model_year']).index.map(most_common_cylinders.get)
df['cylinders'] = df['cylinders'].fillna(pd.Series(common_cylinders_series, index=df.index))
In [15]:
df['cylinders'].unique()
Out[15]:
array([ 6.,  4.,  8.,  5., 10., nan,  3., 12.])

Back to Contents

Reviewing the NaNs¶

In [16]:
df[df['cylinders'].isnull()].sort_values(["model","model_year"])
Out[16]:
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
11087 16000 1971 chevrolet camaro excellent NaN gas 0.0 manual coupe brown False 2018-12-28 125
28799 45900 1971 chevrolet camaro like new NaN gas 0.0 manual coupe orange False 2019-01-10 75
37000 6800 1980 chevrolet malibu good NaN gas 0.0 automatic coupe brown False 2019-02-26 33
3914 9750 2007 chevrolet silverado 1500 crew excellent NaN gas 133000.0 automatic pickup grey False 2018-08-22 55
36582 44900 1949 chevrolet suburban good NaN gas 1800.0 automatic wagon orange False 2018-08-19 10
6982 3900 1977 chevrolet suburban fair NaN gas 0.0 automatic SUV custom False 2019-02-02 71
15623 5300 2005 dodge charger excellent NaN gas 125000.0 automatic sedan red False 2018-10-30 46
33965 5800 1971 ford econoline good NaN gas 78000.0 automatic van blue False 2019-04-01 40
14752 15000 1954 ford f-150 excellent NaN gas 3565.0 manual pickup black False 2019-02-16 13
1101 9200 1975 ford f-150 excellent NaN gas 0.0 automatic truck green False 2018-08-28 40
15948 14999 2007 ford f-250 sd good NaN diesel 237726.0 automatic truck custom True 2018-11-21 46
37146 2000 1994 ford f-250 super duty fair NaN diesel 224000.0 automatic truck white False 2018-07-13 11
3139 3900 2003 ford f-350 sd good NaN gas 0.0 automatic truck black True 2018-06-20 59
6932 4000 1993 ford f150 excellent NaN gas 0.0 automatic truck unknown True 2019-01-28 18
5870 2500 2003 ford f150 supercrew cab xlt good NaN gas 169000.0 automatic truck green True 2019-01-17 18
7113 84900 2019 ford f350 new NaN diesel 9000.0 automatic pickup white True 2019-03-26 7
47761 3800 1984 honda accord new NaN gas 121000.0 manual sedan blue False 2019-01-05 162
7121 6900 2009 jeep cherokee excellent NaN gas 130023.0 automatic SUV black True 2018-09-03 19
43595 5995 1988 jeep wrangler good NaN gas 100309.0 manual SUV white True 2018-08-17 12
36935 3995 2009 kia sorento excellent NaN gas 211698.0 automatic SUV grey True 2018-10-08 56
3844 1900 1986 nissan sentra excellent NaN gas 0.0 manual coupe blue False 2018-10-29 63
35818 1700 1986 nissan sentra excellent NaN gas 152000.0 manual coupe blue False 2018-09-29 23
27075 3800 1988 ram 2500 good NaN gas 64000.0 automatic pickup unknown True 2018-08-26 34
29822 1800 1989 toyota camry excellent NaN gas 196617.0 automatic sedan blue False 2019-03-01 32
39952 1900 1990 toyota camry good NaN gas 156481.0 automatic sedan unknown False 2018-09-29 43
12760 1050 1995 toyota corolla fair NaN gas 428000.0 manual sedan black False 2018-06-13 12
In [17]:
df['cylinders'].isnull().sum() 
Out[17]:
26
In [18]:
# checking for a unique NaN
df1 = df[df["model"] == "toyota camry"]
df1[df1["model_year"] == 1989]
Out[18]:
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
29822 1800 1989 toyota camry excellent NaN gas 196617.0 automatic sedan blue False 2019-03-01 32

To conduct a more precise examination of the NaN values, we could have compared them against a range of years from the same model.

However, given that the number of NaN values is small relative to the overall total, I chose not to focus extensively on this aspect.

Back to Contents

Checking the 'Cylinders' data after filling NaN values¶

In [19]:
df.head(15)
Out[19]:
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
0 9400 2011 bmw x5 good 6.0 gas 145000.0 automatic SUV unknown True 2018-06-23 19
2 5500 2013 hyundai sonata like new 4.0 gas 110000.0 automatic sedan red False 2019-02-07 79
3 1500 2003 ford f-150 fair 8.0 gas 0.0 automatic pickup unknown False 2019-03-22 9
4 14900 2017 chrysler 200 excellent 4.0 gas 80903.0 automatic sedan black False 2019-04-02 28
5 14990 2014 chrysler 300 excellent 6.0 gas 57954.0 automatic sedan black True 2018-06-20 15
6 12990 2015 toyota camry excellent 4.0 gas 79212.0 automatic sedan white False 2018-12-27 73
7 15990 2013 honda pilot excellent 6.0 gas 109473.0 automatic SUV black True 2019-01-07 68
8 11500 2012 kia sorento excellent 4.0 gas 104174.0 automatic SUV unknown True 2018-07-16 19
9 9200 2008 honda pilot excellent 6.0 gas 147191.0 automatic SUV blue True 2019-02-15 17
10 19500 2011 chevrolet silverado 1500 excellent 8.0 gas 128413.0 automatic pickup black True 2018-09-17 38
11 8990 2012 honda accord excellent 4.0 gas 111142.0 automatic sedan grey False 2019-03-28 29
12 18990 2012 ram 1500 excellent 8.0 gas 140742.0 automatic pickup unknown True 2019-04-02 37
13 16500 2018 hyundai sonata excellent 4.0 gas 22104.0 automatic sedan silver False 2019-01-14 29
14 12990 2009 gmc yukon excellent 8.0 gas 132285.0 automatic SUV black True 2019-01-31 24
15 17990 2013 ram 1500 excellent 8.0 gas 0.0 automatic pickup red True 2018-05-15 111
In [20]:
df['cylinders'].unique()
Out[20]:
array([ 6.,  4.,  8.,  5., 10., nan,  3., 12.])
In [21]:
cylinders_counts = df['cylinders'].value_counts()

print("Count of each unique value in 'cylinders' column:")
print(cylinders_counts)
Count of each unique value in 'cylinders' column:
cylinders
8.0     16339
6.0     16330
4.0     14368
10.0      538
5.0       270
3.0        33
12.0        2
Name: count, dtype: int64

Conclusion ¶

The analysis of the 'cylinders' column revealed several key findings.

Firstly, the dataset contains records where the 'cylinders' field is missing or marked as 'None', indicating a data quality issue.

Secondly, the use of the median function to impute missing values may not be ideal, as it can introduce inaccuracies when dealing with an even number of candidates.

Instead, considering the most common value in each group could provide a more accurate representation of the data.

Finally, the absence of the numbers 3 and 12 in the updated list warrants further investigation, as it may indicate errors or sub-model issues in the dataset.

Back to Contents

3.2 Manufacturer of 4WD cars by Year - Histograms ¶

This section examines the distribution of 4WD cars manufactured over the years using histograms.

We add a 'manufacturer' column to the dataset, assuming the first word of the model name indicates the manufacturer.

The analysis focuses on 4WD cars from 2009 onwards, showing the number of cars per manufacturer per year.

We also look specifically at 4WD cars in 2018, filtering for models with at least 20 cars, to visualize the distribution per model.

In [22]:
# Adding the manufacturer column. Assuiming the first word of the model is that.
df['manufacturer'] = df['model'].str.split(' ').str[0]
In [23]:
import plotly.express as px
In [24]:
year_df = df[df['model_year'] >= 2009]
df_count = year_df[year_df['is_4wd']].groupby(['manufacturer', 'model_year']).size().reset_index(name='Number of Cars')
fig = px.bar(df_count, x='model_year', y='Number of Cars', color='manufacturer', barmode='group', title='Amount of 4WD Cars per Year per Manufacturer')
fig.update_layout(xaxis_title="Model Year", xaxis={'tickvals': df_count['model_year']})
fig.show()

The histogram above indicates a notable shift in 4WD vehicle production in 2018.

Ford, which had been the dominant manufacturer from 2010 to 2018, was surpassed by Chevy and lost its leading position.

Back to Contents

The shift in 4WD vehicle production in 2018, with Chevy surpassing Ford, prompted a closer examination of the models produced that year.

The histogram below showcases these models.

In [25]:
year_df = df[df['model_year'] == 2018]
df_count = year_df[year_df['is_4wd']].groupby(['model', 'model_year']).size().reset_index(name='Number of cars')
df_count = df_count[df_count['Number of cars'] >= 20]
fig = px.bar(df_count, x='model_year', y='Number of cars', color='model', barmode='group', title='Amount of 4WD Cars in 2018 per Model (filtered for >= 20)')
fig.show()

Conclusion ¶

The analysis of 4WD vehicle production revealed interesting trends.

The histogram showed a significant change in 2018, with Chevrolet overtaking Ford as the leading manufacturer.

This shift prompted a focused examination of the models produced in 2018, highlighting the dynamic nature of the automotive industry.

The findings underscore the importance of monitoring production trends to understand market dynamics and consumer preferences.

Back to Contents

3.3 Price Vs Model Year - Scatter Plot ¶

In this section, we investigate the correlation between vehicle price and model year through scatter plots.

This analysis aims to uncover trends or patterns in price fluctuations across different model years, providing insights into vehicle depreciation rates and market values over time.

Additionally, we aim to identify any anomalies in these patterns.

In [26]:
df_filtered = df[df['price'] >= 100]
fig = px.scatter(df_filtered, x='model_year', y='price', 
                 hover_name='model', title='Price vs Model Year',
                 labels={'model_year': 'Model Year', 'price': 'Price'})

# Update layout
fig.update_layout(showlegend=False)

# Show the chart
fig.show()

Conclusion ¶

The scatterplot reveals a notable upward trend in prices from 1996 onwards, indicating that newer vehicles tend to have higher prices. However, between the late 1970s and 1990s, the graph shows relative price stability, suggesting the presence of influential factors that require further analysis.

One of the most intriguing observations in the scatter plot is the presence of a dense cloud of data points, indicating inconsistency and numerous exceptions, particularly noticeable between the 1960s and the mid-late 1970s. This phenomenon is likely due to many cars from this era being preserved as collector's items, including 'muscle' cars that were discontinued around the time of the oil crisis. Prices of these vehicles have been well maintained and have even doubled, tripled, and enen more in value compared to relatively newer vehicles. This lack of a clear trend suggests that prices for these vehicles are highly variable, with sellers able to demand and receive prices based on the condition and rarity of the vehicle.

Back to Contents

Stage 4. Develop a Web Application, and Deploy it to Render ¶

Saving the fixed CSV file for the application.¶

In [27]:
df.to_csv('../vehicles_us_fixed.csv')

The final part of the project has been completed and is now being implemented in the app.py file and on the specified platforms.


Final Conclusions ¶

The analysis of the car sales advertisements dataset has provided valuable insights into the trends and characteristics of the vehicles listed.

Here are the key findings and conclusions from the analysis:

Data Overview: The dataset contains 51,525 entries and 13 columns, including attributes such as price, model_year, model, condition, cylinders, fuel, odometer, transmission, type, paint_color, is_4wd, date_posted, and days_listed.

Missing Values: Several columns have missing values, including model_year, cylinders, odometer, paint_color, and is_4wd. These missing values were identified and may need to be addressed in further analysis to ensure the accuracy and completeness of the dataset.

Data Types: The data types for each column are appropriate, ensuring that the dataset is structured correctly for analysis.

Price vs. Model Year: A scatter plot analysis revealed an upward trend in prices from 1996 onwards, with newer vehicles generally commanding higher prices. However, between the late 1970s and 1990s, the graph showed relative price stability, indicating the presence of influencing factors that require further analysis.

Manufacturer of 4WD Cars by Year: Analysis of 4WD vehicle production showed a significant shift in 2018, with Chevrolet overtaking Ford as the leading manufacturer. This shift prompted a focused examination of the models produced in 2018, revealing interesting trends in vehicle production.

Overall, this analysis provides a comprehensive overview of the car sales advertisements dataset and highlights key trends and patterns in the data. Further analysis and exploration of the dataset could provide additional insights into the factors influencing vehicle prices, production trends, and market dynamics.

Back to Contents

In [ ]: